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(57) Abstract 

The disclosed system translates into dif- 
ferent languages HTML documents (16) avail- 
able through the World Wide Web. HTML 
documents ( 16) arc translated by machine trans- 
lation software (10) bundled in a browser 
(12), Altemaiively. documents are retrieved as 
needed, translated, and stored on a Web server 
so user requests arc serviced with a document 
that has been translated from a different lan- 
guage. The disclosed invention expands usage 
of the Internet for non-English speakers. 




-14- 



HTML Linoui9* X 









S>ilii" 









BNSOOCID: <WO ^97ia5t6A1 J_> 



wo 97/18516 



PCT/USW18I02 



INTEGRATED MULTILlNGiUAL BROWSER 

5 

BACKGROUND AND SUMMARY OF THE INVENTION 
The present invention relates generally to the field of electronic communication over a 
computer network. Particularly, the present invention relates to the expansion of muki-lingual 
electronic communication through translation services for documents and messages available 

1 0 through the Internet. 

The recent s-urge ui media attention to the Interact, and especially the World Wide 
Web, coupled with the continuing growth in home PC ownership have resulted in a growing 
diversity of the Internet user population. No longer is the Internet the province of software 
experts; thousands of novice users have begun to come online each day. Software hke 

1 5 CompuServe's Web Browser lets users quickly connect to and find useful content onUne. This 
phenomenon is not restricted to the United States or to English-speaking countries Growth 
in online usage in Europe and Asia is increasing even more quickly than m the U.S. 

While interest in the online world is at a peak, a significant obstacle exists to broad 
usage of the Internet for non-English speakers. The vast majority of Internet content is in 

20 EngUsh, and is therefore inaccessible to users v^th other native languages. Translation of 
Internet documents by a hiunan translator is not a practical solution for two reasons. First, 
human translation is costly and slow. A translator can typicaUy produce 300-400 words per 
hour at costs of 12tf per word or more. Second, in order to have a translator convert Internet 
documents to the user's native language, the user would have to download every document he 

25 was interested in to provide it to the translator. This is a tune-consuming process, and if the 
user knows no English, he will not even be able to assess the relevance of the document before 

1 
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Figure 9 is a diagrammatic view of one embodiment of the present invention in which 
pre-transiated Web pages are accessible from a server. 



DETAIL DESCRIPTION OF PREFERRED EMBOD[MENT(S> 
S Although the detailed description of a preferred embodiment focuses on automatic 

translation of World Wide Web pages, the concept is adaptable to documents obtained from 
other sources. 

The World Wide Web (WWW or the Web) is a distributed inibrmation system thai 
may be accessed through a number of sources. It is comprised of software and a set of 
10 protocols and conventions. Information on the Web may be accessed using a browser 
program such as CompuServe's Web Browser. Browsers allow users to read documents and 
to locate documents from other sources. They present an interface for interacting with the 
system and they process requests on behalf of the user. 

Information providers on the W^WW make their information available through 
1 5 programs that understand the HyperText Transfer Protocol (HTTT*). Browsers assist users in 
'visiting'' Web sites where information is stored. Information is displayed in pages of text and 
graphics called ''Web Pages.'' An example of a Web page as viewed through CompuServe's 
Web Browser is provided in Figures lA and IB. The Web page shown in Figures I A and IB 
contains both text 14, 18 and graphics 10, 12, 16. The title bar 20, menu options 22, buttons 
20 24, and document information 26 appearing at the top of the screen are pan of the browser 
used to view the Web page. 

In most cases, information providers make information available through a Web server 
The server responds to uiformation requests by delivering the requested information to the 
user's browser for viewing. Some providers may make their information available through a 
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Figure 2 is the hypenext document that describes the Web page shown in Figures lA 
and IB. Figiue 2 shows the markups and related words (that comprise codes) as well as data 
characters that may appear in a hypertext document. For example, the characters ''^ ^ and ' ' " 
appearing througiiout the docimient are markups. The characters and " combined with 
5 the word *Tiead ^ ('*<head >'') 10 may be considered a code. Finally, the text ' NTLT Home" 10 
that is not surrounded by markups or codes may be considered data characters. 

As indicated by the brief description, HllVIL documents have a well-defined and 
documented structure defined by a grammar. The codes in a HTML document convey 
important information regarding both the display or presentation of the document itself as well 
10 as related references and commands. Display and presentation information may include color 
information, information about graphics that appear on the page, information about text that 
appears on the page, etc. A HTML document is structured as a series of elements that are 
identified by the language markups and codes. A document includes a head (consisting of a 
title and other optional elements) and a body that is a text flow of paragraphs, lists, images, 
15 and other elements. The various pans of the docimient may be identified by looking at the 
markups or codes in the document. For example, referring again to Figure 2 which shows the 
hypertext for Figures lA and IB, the dociunent head contains the title •'NLT Home ' 10. An 
image contained in the document is identified in the line 

''<br><img src=''file:///nj/iowebsrv/server/8 100-1. 1 /server- 1/iniage/ntl.jpg" height=60 
20 width=640></center>'' 12. 

As may be apparent, the process of translating a HTML document requires 
examination of each character in document. Characters may be examined individually and in 
combination to determine whether they are markups, codes, or data characters. 1 o process a 
dociunent, the processing software examines the character stream that comprises the 
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Figure 2. In this example, the boundary markers used to identify the HTML codes are the 
character pairs ^'{ ." and y\ Any character or character combination that does not normally 
occur in te?a may be used as a boimdary marker. The line that appeared as 
''<.head><title>NfLT Home<tiile><head>" in Figitre 2 ( 10) is preprocessed in Step I to the line 
''{-<head>. > { .<title>. }NLT Home{ .<title>. } {.<head>. in Figure 3 (10). Other hnes in the 
document are preprocessed similarly. 

Step 2. Machine Translation (MT) software performs the translation of text from one 
language to another language. There are many commercially available MT software packages. 
Figure 4 is an illustration of a system in which MT software 10 takes as input text in one 
language 12 and generates a rough draft translation of the text in another language 14 using an 
electronic dictionary 16 and a set of Unguistic and/or statistical rules encoded in the program 
18. MT software can perform language conversion operations very quickly: in some cases, at 
speeds of up to 3,000 words per minute. The translated texts are odt high quality translations, 
but they are usually adequate for imderstanding what the document is about. 

Referring to Figures 5 A and 5B, an example of a translated FfTML document is 
shown. The HTML document of Figures 5A and 5B is the translated version of the 
preprocessed HTML dociunent shown m Figure 3. As described above, the boundar\' markers 
used to identily the HTML codes are the character pairs ' and ^\ ) \ Consequently, the MT 
software ignores all text that falls between the boundary markers. Data characters that are not 
surroimded by boundary markers are translated by the MT software. The preprocessed line 
that appeared as ''{.<head>. }{.<title>. }NLT Home{.<title>.} {.<head>. in Figure 3 (10) is 
translated in Step 2 to the Une ^^{ .<head>.}{ .<title>,}NLT Maison{ .<title>. } { .<head>. in 
Figure 5A ( 10). 
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thai delivery: time to the user may be reduced. Although storing documents requires disk 
space, it may represent a better use of system resources because documents that are accessed 
frequently are translated once rather than every time they are accessed. 

Figure ^ is a diagrammatic view of an alternative implementation in which pre- 
5 translated Web pages are stored on a Web server 14. The translation software resides on a 
translation server 14 (possibly the same machine as the Web server). Popular Web pages 24 
are pre-translated and stored in a cache 28, with additional pages being added as they arc 
requested by users 20. The cache is a dynamic storage device with a finite capacity. New, 
pretranslated pages are added to the cache, but pages will also be removed from the cache if 
10 they are used infrequently or if there are constraints on storage capacity. 

In accessing the system, a user 10, sends to the Web Server 14 a request for a specific 
page in a specific language 12. The Web Server 14 then sends a request to get the desired 
page 16. The method for servicing the request depends on where the page is located. If the 
page has been pre-translated 24 and stored in the cache of pages in multiple languages 28, it is 
15 retrieved fi-om the cache 26 and returned to the user in the requested language 30. If the page 
has not been pre-translated. then the page is retrieved 20 fi^om the World Wide Web 22, 
translated into the requested language, and cached before being sent to the user 30. 

Translation of Web pages, in either the bundled browser/MT configuration or the Web 
Server configuration, requires processing of HTML codes containing reference, command. 
20 and display information. Preferably, tlie HTML codes are identified prior to translation, then 
surrounded by special boundary markers to block the translation process on the codes ITie 
HTML preprocessor uses its knowledge regarding the markups, codes, data characters and the 
structure of HTML documents to determine which codes should be blocked ft^om the 
translation process. After translation is complete, a postprocessing program removes the 
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WHAT IS CLAIMED IS 
I. A method for translating a document, comprising the steps of: 

providing a character stream including codes and data characters in a first 
language; 

5 transmitting said character stream to a language translator: 

recognizing said codes to prevent translation of said codes by said language 
translator; and 

translating a significant ponion of said data characters into a second language 
using said language translator. 
10 2. Tlie method of claim K wherein said codes are HyperText Markup Language codes. 

3. The method of claim 1, wherein said step of recognizing said codes is performed by a 
language translator preprocessor. 

4. The method of claim L wherein boundary markers are placed around said codes to 
prevent translation of said codes by said language translator. 

15 5. The method of claim 1, wherein said language translator is integrated into a browser 
prograna. 

6. The method of claim L wherein said document is pretranslated. 

7. The method of claim 1, further comprising the step of viewing said translated 
HyperText Markup Language document with a browser. 

20 8. A dociunent translation system, comprising: 

a character stream containing codes and data characters in a first language; 
a preprocessor for marking codes in said character stream: 
a language translator for translating into a second language said data characters 
in said preprocessed character stream: and 

11 
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<h(mi> ' — 

<hca(j><mie>NLT Homc</titlc></hcad> 

<br><ccn(cr><img src^Tilcr/V/nl/iowcbsrv/servcr/HinO-J l/scrvcr-l/.inngc/nli ;pt; ' hcignj-hO width-640></ccntcr> 
<hr> 

<ccn(cr><ht> </hl ></ccnier> 

<br><.mg src-'Vilc:///nj/iowcbsrv/scrvcr/SIOO--l .l/$efvcr-[/tm3i;c/nltorod.|P5' hcighc^oO widtn-^6> 
<h I >NLT PfOducu</h I > 

<br><LIL><L(><3 hrcr**hup7/www.comouscf>c.com/cg..o.n/gocscr.c 'Clii C WCFOR 0M '> World CummunKv rorum</a^</UL> 

<bf><UL><U><ahrcr-"hi(o;//wwwcompu$€n^c.com/cg..b.n/gocscfvc'CJS TR.ANSLATE •>Compui<:rvc Document Fransljt.on Scrv.c 

(C0TS)</3;></UL> 

<ccnter><h I > </h I ></ccntcr> 

<br><tmg src«'Tilc:///n|/iowcbsrv/scfvcr/8lOO-l l/scrvcr-.l/imagcAvip jog" hcignr»riO wiOth-<)6> 
<hl>Works-ln-Progrcss<.'h( > 

<br><UL><LI><a hrc!=="nUtcs[.htm'*>£.xDcrimcntal Area (Enter at ycur own fisJc...)</a></UL> 
<bf><ljL><LI><a hfch»'Ilab.h(m">Languag« Lab<.'a></UL> 
<hf><UL><LI><a hrcr«"maiItran.h{m *>E-Ma»l Transla(ion</a></vJL> 
<br><ULxLI><a hrc r-" web trans. htm**> Web Page rr3nsia(ion</i></UL> 
<ccmcrxhl> </hl></ccnier> 

<brXimg src«"rile:///n|/iowcbsrv/scrvcr/8l00-(.l/scp/er-- l/cmagc/fracmm • hc:ghr=»80 wjdth=I06> 
<hl>Propo$als</hl> 

<br><UL><LI><a nref=*htTp://jlammcrs/n;\projccu*.proposal\fosc_gi.doc *>Rosc Colored Giasscs</ax/UL> 
<centcr><h I > <.-h I x/centcr> 

<brXimg src'Tilc ///nJ/iowcbsrv/servcr/SIOO-Ll/scrvcf'- l/image/isjcjpg^ hcighi»80 wtd:n=' I06> 
<hl>Points of Intcrc$t</'h!> 

<bfxULxLI><a hrcl=»'hrtp://w>*w wtiIanicne.cdu:80/-(joncs.'Language Pagc himl">The Human Languages Page</aX/UL> 
<brxULxLlxa hrcf-''htxp://w"ww ai.mit,cclu">MIT AniJlcial Intelligence Lab</ax,'UL> 
<cencefXhl> </h(p^ccntcr> 

<brx,mg >rc'-'*fi!c:///nl/iowcbsiv/5crvcr/8l00*l.i/scrvcr-t/im3ge/madto.3ir fieighr=20 width«Z7> 
<p>5<nd commencs/quescions to: <Jp> 

<bfXa hrcf»"mailio:jlammers^§c$i.compuscfve.com*>NLT iMailbox</3> 

</body> 

</html> 
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